-
Notifications
You must be signed in to change notification settings - Fork 978
Reduce output buffer sizes for pruned pages of columns with a list
parent
#20086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce output buffer sizes for pruned pages of columns with a list
parent
#20086
Conversation
…. Currently has a bug where if chunking actually occurs, things get messed up. In other words, it works with the chunked reader, but only if chunking does not occur.
…n the new flow of the code, causing issues during chunked reads.
…ber of variables being set in many disparate places that got unordered by the code rearrangement. We just have a whole lot of state that's tricky to keep track of these days.
…nate non-empty nulls
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
return std::pair{std::move(table), std::move(buffer)}; | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this simply moved as is from hybrid_scan_test.cpp
. No need to review
return cudf::test::strings_column_wrapper(elements, elements + num_ordered_rows); | ||
} | ||
|
||
std::unique_ptr<cudf::table> concatenate_tables(std::vector<std::unique_ptr<cudf::table>> tables, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this simply moved as is from hybrid_scan_test.cpp
. No need to review
* @param page_mask Page mask indicating if this column needs to be decoded | ||
* @param min_rows crop all rows below min_row | ||
* @param num_rows Maximum number of rows to read | ||
* other settings and records the result in the PageInfo::str_bytes_all field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comments
list
parentlist
parent
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
pre-commit.ci autofix |
/ok to test a471b6e |
/merge |
Follow up of #20086 and #19986. This PR enables skipping decompression of parquet data pages marked as pruned in the new experimental parquet reader. This PR also zeros out nesting size information (used to allocate output buffers) for pruned pages right when it's being computed instead of resetting it later-on just before buffer allocation in (#20086). Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #20192
Description
Follow up of #19986.
This PR reduces the output column buffer sizes needed to materialize columns with a list parent such as
list<list<...>>
,list<str>
,list<list<..<str>..>>
etc. against pruned parquet pages in the next-gen reader. By doing so, we also eliminate non-empty nulls across list hierarchies speeding up their materialization.Checklist